Multivariate data exploration

GEOG 30323

September 27, 2016

Why visualize data?

The greatest value of a picture is when it forces us to notice what we never expected to see.

  • Tukey (1977) quoted in Yau (2013)

Exploring data visually

Source: Yau, Data Points p. 137

Our schedule:

  • Current activities: data exploration through visualization with common chart types
  • Weeks 10-15: deep dive into data visualization
    • More complex chart types
    • How to customize your seaborn plots
    • Best practices in data visualization
    • Interactive web-based graphics
    • Maps!

Exploratory chart types

  • Comparing categories: bar chart, dot plot
  • Part-to-whole: pie chart
  • Change over time: line chart
  • Connections and relationships: scatter plot

Many, many more in these categories - these are just our focus for today!

Python and the web

  • A brief aside: With Python, data on the web is at your fingertips (our topic for Week 9)
  • This week, you will get a preview
import pandas as pd

mx_csv = 'http://personal.tcu.edu/kylewalker/mexico.csv'
mx = pd.read_csv(mx_csv)
mx.head()

Comparing categories

How about sorting our data?

mx_sorted = mx.sort_values(by = 'gdp08', ascending = False)
mx_sorted.head()

Bar charts

Source: FiveThirtyEight.com

Bar charts

  • Length or height of bars proportional to data values, allowing for comparisons between categories
  • The value axis of bar charts must start at zero!!!
  • Recommendation: sort your data values for ease of interpretation

Bar chart with non-zero origin

Source: Fox News via FlowingData.com

Bar charts in Python

%matplotlib inline
import seaborn as sns

mx.plot(x= 'name', y = 'gdp08', kind = 'bar')

Bar charts in seaborn

sns.set(style = 'whitegrid')
sns.barplot(x = 'gdp08', y = 'name', data = mx_sorted)

Dot plots

Source: FiveThirtyEight.com

Dot plots

  • Can be preferable to bar charts - values determined by position along axis rather than bar heights
  • In turn, zero origin not strictly necessary (though consider the context)
  • Sorted data also preferable for dot plots

Dot plots in seaborn

sns.stripplot(x = 'gdp08', y = 'name', data = mx_sorted)

Part-to-whole

  • Categories in relationship to the entire population of values
  • Examples: pie chart, waffle chart, 100% bar chart, tree map
  • Must sum to 100%!

Pie charts in Python

zac = mx[mx.name == 'Zacatecas'].drop(['name', 'FID', 'gdp08', 'mus09'], axis = 1).squeeze()
zac.name = 'Zacatecas'
zac.plot(kind = 'pie', figsize = (6, 6))

Problems with pie charts

Source: Fox Chicago via FlowingData.com

Problems with pie charts

Source: Wikimedia Commons

Line charts

Source: FiveThirtyEight.com

Line charts in pandas

hs_drop = pd.read_csv('http://personal.tcu.edu/kylewalker/data/hs_drop.csv')
hs_drop.sort_values('year', inplace = True)
hs_drop.set_index('year', inplace = True)
hs_drop.plot() # pandas plotting defaults to line charts, infers x from index

Line charts in seaborn

  • Connected points available in pointplot and factorplot
  • Requires long-form data! (More to come on this in the next two weeks)
hs_drop.reset_index(inplace = True)
hs_long = pd.melt(hs_drop, id_vars = 'year', 
                  value_vars = ['m_rate', 'f_rate'], 
                  value_name = 'percent_drop', var_name = 'gender')
# We use factorplot because it gives us greater control over the axes
chart = sns.factorplot(data = hs_long, x = 'year', 
                       y = 'percent_drop', hue = 'gender', size = 8)
chart.set_xticklabels(rotation = 45, step = 3)

Line charts in seaborn

Scatter plots

  • Question: how do the values in two columns covary?
  • Scatter plot: each observation represented by a point; position along x axis dictated by one column value; position along y axis dictated by other column value
  • Regression line: visual representation of estimated statistical relationship between X and Y

Scatter plots

Source: FiveThirtyEight.com

Scatter plots in pandas

mx.plot(x = 'mus09', y = 'pri10', kind = 'scatter')

Scatter plots in seaborn

  • Available in the lmplot and regplot functions
sns.lmplot(data = mx, x = 'mus09', y = 'pri10')

Correlation

  • Correlation coefficient: statistical representation of how two samples covary; ranges between -1 (negative correlation) and +1 (positive correlation)
  • In pandas: .corr()
  • Beware of spurious correlations! http://tylervigen.com/spurious-correlations
mx['mus09'].corr(mx['pri10'])

0.41639990565936902 # the result